Introduction

The purpose of this task is to come up with various forecasting techniques for our term project. Initially, seasonality will be analyzed for each product according to seasonality series will be decomposed. Unfortunately, due to lack of data only daily decomposition can be applied. The reason behind such problem is Trendyol does not share data before May 2020, which causes weekly and monthly decomposition not to be made. Based on decomposition for each product, proper ARIMA models will be build. Then, possible regressor analysis will made for each product according to relations with sold count. ARIMAX models will build with specified regressors to reach better forecast values.

Required libraries and data manipulations

library(caTools)
library(xts)
library(zoo)
library(forecast)
library(urca)
library(ggplot2)


data[, is.discount_days:= 0]
rawdata <- read.csv("/Users/tolgaerdogan/Documents/GitHub/spring21-TolgaErdogann/Data/ProjectRawData.csv", header=TRUE, sep = ";")
rawdata$event_date <- as.Date(rawdata$event_date , format = "%d.%m.%Y")
alldata<-rbind(rawdata,data)
alldata<-as.data.table(alldata)
alldata$is.discount_days[alldata$event_date=="2021-03-08"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-09"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-10"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-11"] <- 1
alldata$is.discount_days[alldata$event_date=="2021-03-12"] <- 1

# Some of the other discount days are set to 1 in ProjectRawData.csv

n<-400
dates<- seq(as.Date("2020-05-25"), length = n, by = "days")
#validation_dates<-seq(as.Date("2021-03-01"), length = n-280, by = "days")

Xiaomi Bluetooth Earphone

Decomposition

earphone<-subset(alldata, alldata$product_content_id==6676673)
earphone_daily_tr<-earphone[earphone$event_date<"2021-06-21"]
earphone_daily_te<-earphone[earphone$event_date>="2021-06-21"]
plot(earphone_daily_tr$sold_count, type ='l')

Plot above shows sold counts of Xiaomi Bluetooth Earphone, which has no visible trend over time and variance seems constant, although there are outlier points.

earphone_ts_daily<-ts(earphone_daily_tr$sold_count, freq=7)
tsdisplay(earphone_ts_daily)

Necessary data manipulations are made in order to obtain daily data. ACF plot shows decreasing significant correlation until lag 8 and PACF shows only significant correlation at lag 1. It will be examined further after decomposition.

earphone_daily_additive<-decompose(earphone_ts_daily, type ="additive")
earphone_daily_multiplicative<-decompose(earphone_ts_daily, type ="multiplicative")
plot(earphone_daily_additive)

plot(earphone_daily_multiplicative)

For additive decomposition random part has less variance compared to multiplicative approach. Also, multiplicative decomposition random part’s variance changes over time. For both model has zero mean for random part.

In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test11=ur.kpss(earphone_daily_additive$random, use.lag="7")
summary(test11)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0112 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test12=ur.kpss(earphone_daily_multiplicative$random, use.lag="7")
summary(test12)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0336 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Additive decomposition gives more stationary random part compared to multiplicative according value of test statisctic and critical values. Hence, additive is chosen for further examinations.

plot(earphone_daily_additive$seasonal[1:30], type='l')

plot(earphone_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition .

Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

earphone_xts<-xts(earphone_daily_tr$sold_count, order.by=as.Date(earphone_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(earphone_xts,on="weeks")
earphone_xts_weekly<-period.apply(earphone_xts,INDEX=weekly,FUN=mean)
plot(earphone_xts_weekly)

monthly<-endpoints(earphone_xts,on="months")
earphone_xts_monthly<-period.apply(earphone_xts,INDEX=monthly,FUN=mean)
plot(earphone_xts_monthly)

Weekly and monthly plots does not show any significant pattern or trend. If we had more data points, maybe a more meaningful result could be reached from these graphs.

ARIMA Model

tsdisplay(earphone_daily_additive$random, na.action = na.omit)

According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 3 which means that AR parameter should be 3.

earphone_model<-arima(earphone_daily_additive$random, order = c(3,0,3))
AIC(earphone_model)
## [1] 4641.866
earphone_model2<-arima(earphone_daily_additive$random, order = c(3,0,2))
AIC(earphone_model2)
## [1] 4639.877
earphone_model3<-arima(earphone_daily_additive$random, order = c(2,0,2))
AIC(earphone_model3)
## [1] 4643.746

When adjacent parameters also checked, model 2 which has (3,0,2) parameters is better in terms of AIC values, therefore it is chosen.

Fitting the model

model_forecast_error <- predict(earphone_model2, n.ahead = 11)$pred
last_trend_value <-tail(earphone_daily_additive$trend[!is.na(earphone_daily_additive$trend)],1)
seasonality=earphone_daily_additive$seasonal[295:305]
earphone_arima_forecast=model_forecast_error+seasonality+last_trend_value

earphone_daily_te$forecast<-paste(earphone_arima_forecast)
earphone_daily_te$forecast<-as.numeric(earphone_daily_te$forecast)


ggplot(earphone_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Model that we built with ARIMA (3,0,2) parameters does not explain why earphone is not sold lately.

error_test(earphone_daily_te$sold_count,earphone_daily_te$forecast)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 353.6364 106.4878 -0.7109682 0.8337796 251.4242 0.7109682 -0.8337796

WMAPE is 68% which is too much. ARIMA model clearly overpredicts the sales volume. This is mainly due to the constant trend assumption made.

ARIMAX Model

earphone_xreg1<-cbind(earphone_daily_tr$favored_count,
                  earphone_daily_tr$basket_count,
                  earphone_daily_tr$visit_count)
earphone_xreg2<-cbind(earphone_daily_te$favored_count,
                  earphone_daily_te$basket_count,
                  earphone_daily_te$visit_count)

earphone_arimax<-Arima(earphone_ts_daily,xreg=as.matrix(earphone_xreg1),order=c(3,0,2))
earphone_daily_te$forecastarimax<-forecast(earphone_arimax,xreg=as.matrix(earphone_xreg2))$mean
error_test(earphone_daily_te$sold_count,earphone_daily_te$forecastarimax)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 353.6364 106.4878 -0.1613391 0.2019397 57.68401 0.1631167 -0.2008867

WMAPE is 0.16 which is little higher than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.

ggplot(earphone_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

There is excessive difference forecasted values between ARIMA and ARIMAX models. As can be seen from plot above, regressors improved model incontestably.

Fakir Vacuum Cleaner

Decomposition

vacuum<-subset(alldata, alldata$product_content_id==7061886)
vacuum_daily_tr<-vacuum[vacuum$event_date<"2021-06-21"]
vacuum_daily_te<-vacuum[vacuum$event_date>="2021-06-21"]
plot(vacuum_daily_tr$sold_count, type ='l')

Fakir vacuum cleaner daily sold count plot for given days can be seen above. Sales of item has significant visible outlier points.

vacuum_ts_daily<-ts(vacuum_daily_tr$sold_count, freq=7)
tsdisplay(vacuum_ts_daily)

Necessary data manipulations are made in order to obtain daily data. ACF plot seems like there is seasonality, also only significant PACF at lag 1.

vacuum_daily_additive<-decompose(vacuum_ts_daily, type ="additive")
vacuum_daily_multiplicative<-decompose(vacuum_ts_daily, type ="multiplicative")
plot(vacuum_daily_additive)

plot(vacuum_daily_multiplicative)

Additive decomposition random part yield better result in terms of variance.

In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test13=ur.kpss(vacuum_daily_additive$random, use.lag="7")
summary(test13)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0111 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test14=ur.kpss(vacuum_daily_multiplicative$random, use.lag="7")
summary(test14)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1153 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

As expected from random part graphs, additive decomposition gave more stationary random part.

plot(vacuum_daily_additive$seasonal[1:30], type='l')

plot(vacuum_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition .

Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

vacuum_xts<-xts(vacuum_daily_tr$sold_count, order.by=as.Date(vacuum_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(vacuum_xts,on="weeks")
vacuum_xts_weekly<-period.apply(vacuum_xts,INDEX=weekly,FUN=mean)
plot(vacuum_xts_weekly)

The peaks in November coincide with the trendyol discount days and black friday days that we determined in our project.

monthly<-endpoints(vacuum_xts,on="months")
vacuum_xts_monthly<-period.apply(vacuum_xts,INDEX=monthly,FUN=mean)
plot(vacuum_xts_monthly)

From monthly sales data, november has highest sales which is expected due to price discounts and advertisements.

ARIMA Model

tsdisplay(vacuum_daily_additive$random, na.action = na.omit)

According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.

vacuum_model<-arima(vacuum_daily_additive$random, order = c(1,0,3))
AIC(vacuum_model)
## [1] 3268.003
vacuum_model2<-arima(vacuum_daily_additive$random, order = c(1,0,2))
AIC(vacuum_model2)
## [1] 3375.757
vacuum_model3<-arima(vacuum_daily_additive$random, order = c(2,0,2))
AIC(vacuum_model3)
## [1] 3354.103

When adjacent parameters also checked, still model 1 which has (1,0,3) parameters is better in terms of AIC values, therefore it is chosen.

Fitting the model

model_forecast_error <- predict(vacuum_model, n.ahead = 11)$pred
last_trend_value <-tail(vacuum_daily_additive$trend[!is.na(vacuum_daily_additive$trend)],1)
seasonality=vacuum_daily_additive$seasonal[295:305]
vacuum_arima_forecast=model_forecast_error+seasonality+last_trend_value

vacuum_daily_te$forecast<-paste(vacuum_arima_forecast)
vacuum_daily_te$forecast<-as.numeric(vacuum_daily_te$forecast)


ggplot(vacuum_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. It seems like although ARIMA model catches ups and downs on sales, it is not explaining clearly.

error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecast)
##    n     mean       sd       bias     mape      mad     wmape        MPE
## 1 11 12.45455 6.990903 -0.2023177 0.904415 6.506546 0.5224234 -0.6808567

WMAPE is 56% which is high. Whereas model catched the seasonality constant trend assumption increased the errors.

Regressor Search

vacuum_lm1<-lm(sold_count~visit_count+is.discount_days, data=vacuum_daily_tr)
summary(vacuum_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = vacuum_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -128.701  -15.605   -5.764    6.033  158.580 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       36.967454   1.918260  19.271  < 2e-16 ***
## visit_count       -0.003763   0.001448  -2.599  0.00969 ** 
## is.discount_days 113.452407   8.294838  13.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.5 on 389 degrees of freedom
## Multiple R-squared:  0.333,  Adjusted R-squared:  0.3295 
## F-statistic: 97.09 on 2 and 389 DF,  p-value: < 2.2e-16
vacuum_lm2<-lm(sold_count~category_sold+price, data=vacuum_daily_tr)
summary(vacuum_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = vacuum_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.151  -15.039    2.828   10.488  186.436 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   207.969490  16.240448   12.81   <2e-16 ***
## category_sold   0.103703   0.007536   13.76   <2e-16 ***
## price          -0.725780   0.062501  -11.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.69 on 389 degrees of freedom
## Multiple R-squared:  0.4075, Adjusted R-squared:  0.4044 
## F-statistic: 133.7 on 2 and 389 DF,  p-value: < 2.2e-16
vacuum_lm3<-lm(sold_count~price+favored_count+category_sold, data=vacuum_daily_tr)
summary(vacuum_lm3)
## 
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold, 
##     data = vacuum_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -119.934  -15.629    3.049   11.286  186.813 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   257.599267  19.818724  12.998  < 2e-16 ***
## price          -0.941892   0.079986 -11.776  < 2e-16 ***
## favored_count   0.064816   0.015443   4.197 3.36e-05 ***
## category_sold   0.106979   0.007421  14.415  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.08 on 388 degrees of freedom
## Multiple R-squared:  0.4332, Adjusted R-squared:  0.4288 
## F-statistic: 98.84 on 3 and 388 DF,  p-value: < 2.2e-16
vacuum_lm4<-lm(sold_count~favored_count+basket_count+visit_count, data=vacuum_daily_tr)
summary(vacuum_lm4)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count, 
##     data = vacuum_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -139.248   -4.510   -0.059    4.581  141.389 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.298130   1.736934  -0.172 0.863809    
## favored_count -0.038383   0.010279  -3.734 0.000217 ***
## basket_count   0.294578   0.008380  35.154  < 2e-16 ***
## visit_count    0.001147   0.001166   0.984 0.325961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.69 on 388 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7641 
## F-statistic: 423.2 on 3 and 388 DF,  p-value: < 2.2e-16
vacuum_daily_te$forecastlm1<-predict(vacuum_lm1,vacuum_daily_te)
vacuum_daily_te$forecastlm2<-predict(vacuum_lm2,vacuum_daily_te)
vacuum_daily_te$forecastlm3<-predict(vacuum_lm3,vacuum_daily_te)
vacuum_daily_te$forecastlm4<-predict(vacuum_lm4,vacuum_daily_te)

error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm1)
##    n     mean       sd      bias     mape     mad    wmape       MPE
## 1 11 12.45455 6.990903 -1.717622 3.400904 21.3922 1.717622 -3.400904
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm2)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 12.45455 6.990903 -4.406717 5.271331 54.88365 4.406717 -5.271331
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm3)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 12.45455 6.990903 -3.684128 3.698435 45.88414 3.684128 -3.698435
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastlm4)
##    n     mean       sd      bias      mape      mad     wmape        MPE
## 1 11 12.45455 6.990903 0.1534633 0.2839136 2.947285 0.2366433 0.01318687

4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square and WMAPE 4th model is explaining sales behavior better than others. Therefore, 4th model’s regressors will be chosen which are favored count, basket count and visit count.

ARIMAX Model

vacuum_xreg1<-cbind(vacuum_daily_tr$favored_count,
                  vacuum_daily_tr$basket_count,
                  vacuum_daily_tr$visit_count)
vacuum_xreg2<-cbind(vacuum_daily_te$favored_count,
                  vacuum_daily_te$basket_count,
                  vacuum_daily_te$visit_count)

vacuum_arimax<-Arima(vacuum_ts_daily,xreg=as.matrix(vacuum_xreg1),order=c(1,0,3))
vacuum_daily_te$forecastarimax<-forecast(vacuum_arimax,xreg=as.matrix(vacuum_xreg2))$mean
error_test(vacuum_daily_te$sold_count,vacuum_daily_te$forecastarimax)
##    n     mean       sd        bias      mape      mad     wmape       MPE
## 1 11 12.45455 6.990903 0.007015643 0.3925069 2.761589 0.2217334 -0.247751

WMAPE is 0.23 which is better than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.

ggplot(vacuum_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Vacuum cleaner sales count explained clearly with favored count, basket count and visit count.

La Roche Posey Facial Cleanser

Decomposition

facial_cleanser<-subset(alldata, alldata$product_content_id==85004)
facial_cleanser_daily_tr<-facial_cleanser[facial_cleanser$event_date<"2021-06-21"]
facial_cleanser_daily_te<-facial_cleanser[facial_cleanser$event_date>="2021-06-21"]
plot(facial_cleanser_daily_tr$sold_count, type ='l')

Plot above shows daily sales of La Roche Posay face cleanser. Sales of such item increased over time and variance increased at fall and winter term.

facial_cleanser_ts_daily<-ts(facial_cleanser_daily_tr$sold_count, freq=7)
tsdisplay(facial_cleanser_ts_daily)

In order to see the daily effect in the data, a time series was created by giving frequency 7. ACF plot seems like there is seasonality, also only significant PACF at lag 1.

facial_cleanser_daily_additive<-decompose(facial_cleanser_ts_daily, type ="additive")
facial_cleanser_daily_multiplicative<-decompose(facial_cleanser_ts_daily, type ="multiplicative")
plot(facial_cleanser_daily_additive)

plot(facial_cleanser_daily_multiplicative)

Additive decomposition random part has less variance but it is not constant over time. Multiplicative decomposition random part has more variance but it is constant over time. It’s hard to make a definitive statement just by looking at the charts. Therefore, necessary tests will be applied to check stationarity.

test15=ur.kpss(facial_cleanser_daily_additive$random, use.lag="7")
summary(test15)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0092 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test16=ur.kpss(facial_cleanser_daily_multiplicative$random, use.lag="7")
summary(test16)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1831 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Additive decomposition method should be chosen since it gives more stationary random part according to test results.

plot(facial_cleanser_daily_additive$seasonal[1:30], type='l')

plot(facial_cleanser_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition .

Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

facial_cleanser_xts<-xts(facial_cleanser_daily_tr$sold_count, order.by=as.Date(facial_cleanser_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(facial_cleanser_xts,on="weeks")
facial_cleanser_xts_weekly<-period.apply(facial_cleanser_xts,INDEX=weekly,FUN=mean)
plot(facial_cleanser_xts_weekly)

monthly<-endpoints(facial_cleanser_xts,on="months")
facial_cleanser_xts_monthly<-period.apply(facial_cleanser_xts,INDEX=monthly,FUN=mean)
plot(facial_cleanser_xts_monthly)

Sales in November are affected from discount days which are determined in our project to yield better forecast result.

ARIMA Model

tsdisplay(facial_cleanser_daily_additive$random, na.action = na.omit)

According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.

facial_cleanser_model<-arima(facial_cleanser_daily_additive$random, order = c(1,0,3))
AIC(facial_cleanser_model)
## [1] 3644.424
facial_cleanser_model2<-arima(facial_cleanser_daily_additive$random, order = c(1,0,2))
AIC(facial_cleanser_model2)
## [1] 3753.453
facial_cleanser_model3<-arima(facial_cleanser_daily_additive$random, order = c(2,0,2))
AIC(facial_cleanser_model3)
## [1] 3729.383

When adjacent parameters also checked, still model 1 which has (1,0,3) parameters is better in terms of AIC values, therefore it is chosen.

Fitting the model

model_forecast_error <- predict(facial_cleanser_model, n.ahead = 11)$pred
last_trend_value <-tail(facial_cleanser_daily_additive$trend[!is.na(facial_cleanser_daily_additive$trend)],1)
seasonality=facial_cleanser_daily_additive$seasonal[295:305]
facial_cleanser_arima_forecast=model_forecast_error+seasonality+last_trend_value

facial_cleanser_daily_te$forecast<-paste(facial_cleanser_arima_forecast)
facial_cleanser_daily_te$forecast<-as.numeric(facial_cleanser_daily_te$forecast)


ggplot(facial_cleanser_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. While actual sales seesaw sharp, predicted sales has smooth changes day by day. WMAPE should also be checked.

error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecast)
##    n     mean       sd        bias      mape      mad    wmape        MPE
## 1 11 71.54545 21.59335 -0.04346981 0.2029299 13.45183 0.188018 -0.1021247

WMAPE value is 17% which is not bad. To obtain a better model, appropriate regressors should be investigated.

Regressor Search

facial_cleanser_lm1<-lm(sold_count~visit_count+is.discount_days, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = facial_cleanser_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -165.19  -24.30  -12.92    8.25  342.08 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.492e+01  3.087e+00  17.790  < 2e-16 ***
## visit_count      8.726e-03  1.153e-03   7.571 2.71e-13 ***
## is.discount_days 1.645e+02  1.303e+01  12.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.47 on 389 degrees of freedom
## Multiple R-squared:  0.3605, Adjusted R-squared:  0.3572 
## F-statistic: 109.6 on 2 and 389 DF,  p-value: < 2.2e-16
facial_cleanser_lm2<-lm(sold_count~category_sold+price, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = facial_cleanser_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.96  -36.21  -15.54   20.80  297.59 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   358.941849  42.947829   8.358 1.15e-15 ***
## category_sold   0.024796   0.002796   8.869  < 2e-16 ***
## price          -3.855490   0.555368  -6.942 1.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.27 on 389 degrees of freedom
## Multiple R-squared:  0.2017, Adjusted R-squared:  0.1976 
## F-statistic: 49.14 on 2 and 389 DF,  p-value: < 2.2e-16
facial_cleanser_lm3<-lm(sold_count~price+favored_count+category_sold, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm3)
## 
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold, 
##     data = facial_cleanser_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162.66  -24.63   -6.67   15.65  328.02 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   477.719639  37.749407  12.655  < 2e-16 ***
## price          -5.649418   0.493944 -11.437  < 2e-16 ***
## favored_count   0.060934   0.004961  12.283  < 2e-16 ***
## category_sold   0.018515   0.002430   7.619 1.97e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.96 on 388 degrees of freedom
## Multiple R-squared:  0.4252, Adjusted R-squared:  0.4207 
## F-statistic: 95.67 on 3 and 388 DF,  p-value: < 2.2e-16
facial_cleanser_lm4<-lm(sold_count~favored_count+basket_count+visit_count+is.discount_days+category_sold, data=facial_cleanser_daily_tr)
summary(facial_cleanser_lm4)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count + 
##     is.discount_days + category_sold, data = facial_cleanser_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -141.291   -9.912    0.349    9.431  137.337 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.023e+01  2.270e+00  -4.507 8.74e-06 ***
## favored_count    -4.531e-03  3.528e-03  -1.284   0.1999    
## basket_count      2.934e-01  8.980e-03  32.673  < 2e-16 ***
## visit_count      -1.503e-02  8.611e-04 -17.458  < 2e-16 ***
## is.discount_days  5.102e+01  7.080e+00   7.206 3.06e-12 ***
## category_sold     3.477e-03  1.211e-03   2.872   0.0043 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.91 on 386 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8621 
## F-statistic: 489.9 on 5 and 386 DF,  p-value: < 2.2e-16
facial_cleanser_daily_te$forecastlm1<-predict(facial_cleanser_lm1,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm2<-predict(facial_cleanser_lm2,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm3<-predict(facial_cleanser_lm3,facial_cleanser_daily_te)
facial_cleanser_daily_te$forecastlm4<-predict(facial_cleanser_lm4,facial_cleanser_daily_te)

error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm1)
##    n     mean       sd        bias      mape     mad     wmape        MPE
## 1 11 71.54545 21.59335 -0.09109016 0.2302715 14.5634 0.2035545 -0.1604838
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm2)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 71.54545 21.59335 -0.6463231 0.6857502 46.24148 0.6463231 -0.6857502
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm3)
##    n     mean       sd       bias     mape      mad     wmape        MPE
## 1 11 71.54545 21.59335 -0.1772046 0.227159 16.42899 0.2296301 -0.1677145
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastlm4)
##    n     mean       sd       bias      mape     mad     wmape        MPE
## 1 11 71.54545 21.59335 0.03935856 0.2557536 16.8197 0.2350911 0.05802043

4 different multiple linear regression model is built to understand which regressors are explaining the model better. In terms of adjusted R square, 4th model is explaining sales behavior better than others. When WMAPE is checked, 1st model has minimum value among all models. Nevertheless, 1st and 4th model has close WMAPE values. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, visit count, category sold and discount days.

ARIMAX Model

facial_cleanser_xreg1<-cbind(facial_cleanser_daily_tr$favored_count,
                  facial_cleanser_daily_tr$basket_count,
                  facial_cleanser_daily_tr$category_sold,
                  facial_cleanser_daily_tr$is.discount_days,
                  facial_cleanser_daily_tr$visit_count)
facial_cleanser_xreg2<-cbind(facial_cleanser_daily_te$favored_count,
                  facial_cleanser_daily_te$basket_count,
                  facial_cleanser_daily_te$category_sold,
                  facial_cleanser_daily_te$is.discount_days,
                  facial_cleanser_daily_te$visit_count)

facial_cleanser_arimax<-Arima(facial_cleanser_ts_daily,xreg=as.matrix(facial_cleanser_xreg1),order=c(1,0,3))
facial_cleanser_daily_te$forecastarimax<-forecast(facial_cleanser_arimax,xreg=as.matrix(facial_cleanser_xreg2))$mean
error_test(facial_cleanser_daily_te$sold_count,facial_cleanser_daily_te$forecastarimax)
##    n     mean       sd        bias      mape      mad     wmape         MPE
## 1 11 71.54545 21.59335 -0.07684045 0.2303227 15.23747 0.2129761 -0.06672833

WMAPE is 0.18 which is better than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.

ggplot(facial_cleanser_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Facial cleanser sales count explained clearly with favored count, basket count, category sold, visit count and discount days.

Sleepy Baby Wet Wipe

Decomposition

babywipe<-subset(alldata, alldata$product_content_id==4066298)
babywipe_daily_tr<-babywipe[babywipe$event_date<"2021-06-21"]
babywipe_daily_te<-babywipe[babywipe$event_date>="2021-06-21"]
plot(babywipe_daily_tr$sold_count, type ='l')

The chart above shows a visualized version of the data. Looking at the graph, it can be said that there are outlier points and seasonality is visible.

babywipe_ts_daily<-ts(babywipe_daily_tr$sold_count, freq=7)
tsdisplay(babywipe_ts_daily)

In order to obtain a time series, necessary manipulations were made on the data. ACF plot shows significant correlation at lag 1,2 and 3, which shows that data has trend.

babywipe_daily_additive<-decompose(babywipe_ts_daily, type ="additive")
babywipe_daily_multiplicative<-decompose(babywipe_ts_daily, type ="multiplicative")
plot(babywipe_daily_additive)

plot(babywipe_daily_multiplicative)

The trend, seasonal and random graphs, which are formed when decomposed as additive and multiplicative, are shown above. Zero mean assumption looks like satisfied for both additive and multiplicative random parts. Neverthless, additive random part has visibly less variance compared to multiplicative random part.

In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test17=ur.kpss(babywipe_daily_additive$random, use.lag="7")
summary(test17)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0589 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test18=ur.kpss(babywipe_daily_multiplicative$random, use.lag="7")
summary(test18)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0859 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The test-statistic for both decomposition method is smaller than the crucial value, indicating that we are unable to reject the null hypothesis. As a result, we can conclude the random data is stationary. However, additive decomposition has smaller test statistic. So, it should be used for further analysis.

plot(babywipe_daily_additive$seasonal[1:30], type='l')

plot(babywipe_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition .

Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

babywipe_xts<-xts(babywipe_daily_tr$sold_count, order.by=as.Date(babywipe_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(babywipe_xts,on="weeks")
babywipe_xts_weekly<-period.apply(babywipe_xts,INDEX=weekly,FUN=mean)
plot(babywipe_xts_weekly)

monthly<-endpoints(babywipe_xts,on="months")
babywipe_xts_monthly<-period.apply(babywipe_xts,INDEX=monthly,FUN=mean)
plot(babywipe_xts_monthly)

Sales in November are influenced by discount days which are specified to get a better prediction outcome.

ARIMA Model

tsdisplay(babywipe_daily_additive$random, na.action = na.omit)

According to ACF plot, there is significant negative correlation at lag 3. Hence, moving average parameter should be 3. Also, it can be seen from PACF plot that correlation significantly decreased after lag 3 which means that AR parameter should be 3.

babywipe_model<-arima(babywipe_daily_additive$random, order = c(3,0,3))
AIC(babywipe_model)
## [1] 5217.269
babywipe_model2<-arima(babywipe_daily_additive$random, order = c(2,0,3))
AIC(babywipe_model2)
## [1] 5194.018
babywipe_model3<-arima(babywipe_daily_additive$random, order = c(1,0,3))
AIC(babywipe_model3)
## [1] 5227.107

Model 2 is better in terms of AIC values, therefore it is chosen.

Fitting the model

model_forecast_error <- predict(babywipe_model2, n.ahead = 11)$pred
last_trend_value <-tail(babywipe_daily_additive$trend[!is.na(babywipe_daily_additive$trend)],1)
seasonality=babywipe_daily_additive$seasonal[295:305]
babywipe_arima_forecast=model_forecast_error+seasonality+last_trend_value

babywipe_daily_te$forecast<-paste(babywipe_arima_forecast)
babywipe_daily_te$forecast<-as.numeric(babywipe_daily_te$forecast)


ggplot(babywipe_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Seasonal effects are seen clearly between actual and fitted values. Forecasted values are higher than actual due to the fact that last trend value is used for all predictions. If different model was built in order to predict trend values for coming days, better result may achieved.

error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecast)
##    n mean       sd      bias      mape      mad    wmape        MPE
## 1 11  521 143.7651 -0.615196 0.6609404 320.5171 0.615196 -0.6609404

WMAPE is 0.61. Model can be improved by adding regressor which will lower WMAPE.

Regressor Search

babywipe_lm1<-lm(sold_count~visit_count+is.discount_days, data=babywipe_daily_tr)
summary(babywipe_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = babywipe_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1104.11  -144.55   -81.80    28.45  2936.01 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      270.29626   20.64823   13.09  < 2e-16 ***
## visit_count        0.03880    0.00457    8.49  4.4e-16 ***
## is.discount_days 984.69064   92.35038   10.66  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 350 on 389 degrees of freedom
## Multiple R-squared:  0.3374, Adjusted R-squared:  0.334 
## F-statistic: 99.06 on 2 and 389 DF,  p-value: < 2.2e-16
babywipe_lm2<-lm(sold_count~category_sold+price, data=babywipe_daily_tr)
summary(babywipe_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = babywipe_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -685.84  -76.99    8.36   80.37 1345.98 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   924.398434 194.061445   4.763 2.69e-06 ***
## category_sold   0.208919   0.006506  32.111  < 2e-16 ***
## price         -12.929247   2.651374  -4.876 1.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 185.5 on 389 degrees of freedom
## Multiple R-squared:  0.8139, Adjusted R-squared:  0.813 
## F-statistic: 850.8 on 2 and 389 DF,  p-value: < 2.2e-16
babywipe_lm3<-lm(sold_count~price+favored_count+basket_count+category_sold, data=babywipe_daily_tr)
summary(babywipe_lm3)
## 
## Call:
## lm(formula = sold_count ~ price + favored_count + basket_count + 
##     category_sold, data = babywipe_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -644.26  -41.18   10.37   48.97 1169.68 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    57.469328 124.209428   0.463    0.644    
## price          -1.220304   1.695079  -0.720    0.472    
## favored_count  -0.363498   0.023540 -15.442   <2e-16 ***
## basket_count    0.283905   0.011306  25.111   <2e-16 ***
## category_sold   0.113418   0.005918  19.163   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.1 on 387 degrees of freedom
## Multiple R-squared:   0.93,  Adjusted R-squared:  0.9293 
## F-statistic:  1286 on 4 and 387 DF,  p-value: < 2.2e-16
babywipe_lm4<-lm(sold_count~favored_count+basket_count+category_sold+is.discount_days, data=babywipe_daily_tr)
summary(babywipe_lm4)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + category_sold + 
##     is.discount_days, data = babywipe_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -740.20  -39.02   11.37   47.56 1125.06 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -25.788167   8.588435  -3.003 0.002850 ** 
## favored_count     -0.351183   0.023274 -15.089  < 2e-16 ***
## basket_count       0.283635   0.010722  26.454  < 2e-16 ***
## category_sold      0.108112   0.005978  18.084  < 2e-16 ***
## is.discount_days 121.622410  33.420190   3.639 0.000311 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 112.2 on 387 degrees of freedom
## Multiple R-squared:  0.9322, Adjusted R-squared:  0.9315 
## F-statistic:  1331 on 4 and 387 DF,  p-value: < 2.2e-16
babywipe_daily_te$forecastlm1<-predict(babywipe_lm1,babywipe_daily_te)
babywipe_daily_te$forecastlm2<-predict(babywipe_lm2,babywipe_daily_te)
babywipe_daily_te$forecastlm3<-predict(babywipe_lm3,babywipe_daily_te)
babywipe_daily_te$forecastlm4<-predict(babywipe_lm4,babywipe_daily_te)

error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm1)
##    n mean       sd         bias      mape      mad     wmape        MPE
## 1 11  521 143.7651 -0.007162319 0.1323333 62.80507 0.1205472 -0.0402405
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm2)
##    n mean       sd       bias      mape      mad     wmape        MPE
## 1 11  521 143.7651 -0.4633883 0.4653101 241.4253 0.4633883 -0.4653101
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm3)
##    n mean       sd       bias      mape      mad     wmape        MPE
## 1 11  521 143.7651 -0.2103372 0.2161906 109.5857 0.2103372 -0.2161906
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastlm4)
##    n mean       sd       bias      mape      mad     wmape        MPE
## 1 11  521 143.7651 -0.1886224 0.1947759 98.27226 0.1886224 -0.1947759

4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square, 4th model is explaining sales behavior better than others. When WMAPE is checked, 1st model has minimum value among all models. Nevertheless, 1st and 4th model has close WMAPE values. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, category sold and whether is it discount day or not.

ARIMAX Model

babywipe_xreg1<-cbind(babywipe_daily_tr$favored_count,
                  babywipe_daily_tr$basket_count,
                  babywipe_daily_tr$category_sold,
                  babywipe_daily_tr$is.discount_days)
babywipe_xreg2<-cbind(babywipe_daily_te$favored_count,
                  babywipe_daily_te$basket_count,
                  babywipe_daily_te$category_sold,
                  babywipe_daily_te$is.discount_days)

babywipe_arimax<-Arima(babywipe_ts_daily,xreg=as.matrix(babywipe_xreg1),order=c(2,0,3))
babywipe_daily_te$forecastarimax<-forecast(babywipe_arimax,xreg=as.matrix(babywipe_xreg2))$mean
error_test(babywipe_daily_te$sold_count,babywipe_daily_te$forecastarimax)
##    n mean       sd       bias     mape      mad    wmape        MPE
## 1 11  521 143.7651 -0.1168833 0.158371 85.02354 0.163193 -0.1031291

WMAPE is 0.14 which is desirable for forecasting. Plot below shows forecasted values with ARIMAX vs actual sales.

ggplot(babywipe_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

As can be seen from plot, actual and forecasted values are close to each other and WMAPE is low as desired. Wet wipes sales count explained clearly with favored count, basket count, category sold and discount days.

Altınyıldız Classics Coat

Decomposition

mont<-subset(alldata, alldata$product_content_id==48740784)
#mont_xts<-xts(mont,order.by=mont$event_date)
mont_daily_tr<-mont[mont$event_date<"2021-06-21"]
mont_daily_te<-mont[mont$event_date>="2021-06-21"]
plot(mont$sold_count, type = 'l')

Plot above shows sold counts of Altınyıldız Classics coat. Coat were available only at winter and at certain days.

mont_ts_daily<-ts(mont_daily_tr$sold_count, freq=7)
tsdisplay(mont_ts_daily)

Necessary data manipulations are made in order to construct a time series object with frequency 7. We see autocorrelation at lag 7 resulting that there is seasonality in weekly level. It will be examined further after decomposition.

mont_daily_additive<-decompose(mont_ts_daily, type ="additive")
mont_daily_multiplicative<-decompose(mont_ts_daily, type ="multiplicative")
plot(mont_daily_additive)

plot(mont_daily_multiplicative)

Since the coat is not sold at certain months constant variance assumption is violated in both cases.

In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test1=ur.kpss(mont_daily_additive$random, use.lag="7")
summary(test1)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0104 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test2=ur.kpss(mont_daily_multiplicative$random, use.lag="7")
summary(test2)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1086 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Test statistic of additive decomposition resulted better therefore it is chosen for further examinations.

plot(mont_daily_additive$seasonal[1:30], type='l')

plot(mont_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition. We can clearly see a seasonality in daily level.

Since there are not enough data points, decomposition cannot made at different levels. Although we cannot decompose, weekly and monthly plots can be seen below.

Weekly

mont_xts<-xts(mont_daily_tr$sold_count, order.by=as.Date(mont_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(mont_xts,on="weeks")
mont_xts_weekly<-period.apply(mont_xts,INDEX=weekly,FUN=mean)
plot(mont_xts_weekly)

Monthly

mont_xts<-xts(mont_daily_tr$sold_count, order.by=as.Date(mont_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(mont_xts,on="months")
mont_xts_weekly<-period.apply(mont_xts,INDEX=monthly,FUN=mean)
plot(mont_xts_weekly)

ARIMA Model

tsdisplay(mont_daily_additive$random, na.action = na.omit)

mont_model<-arima(mont_daily_additive$random, order = c(3,0,4))
summary(mont_model)
## 
## Call:
## arima(x = mont_daily_additive$random, order = c(3, 0, 4))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2     ma3     ma4  intercept
##       0.2799  0.3529  -0.4270  -0.8088  -0.8964  0.5796  0.1256      0e+00
## s.e.  0.1689  0.1657   0.0903   0.1700   0.2437  0.0797  0.0797      4e-04
## 
## sigma^2 estimated as 3.751:  log likelihood = -807.21,  aic = 1632.41
## 
## Training set error measures:
##                      ME     RMSE       MAE      MPE     MAPE      MASE
## Training set 0.01248532 1.936705 0.7504695 86.26919 199.7648 0.6242751
##                     ACF1
## Training set 0.007485642

Random component of the additive model is investigated in terms of autocorrelation and partial autocorrelation to find appropriate p,d,q values. There is a significant autocorrelation at lag 3 in ACF and partial autocorrelation decreases after lag 4, therefore MA 3 and AR 4 can be used

mont_model2<-arima(mont_daily_additive$random, order = c(2,0,4))
AIC(mont_model2)
## [1] 1628.922

Model 2 is better in terms of AIC values, therefore it is chosen to perform the forecast.

Fitting the model

model_forecast_error <- predict(mont_model2, n.ahead = 11)$pred
last_trend_value <-tail(mont_daily_additive$trend[!is.na(mont_daily_additive$trend)],1)
seasonality=mont_daily_additive$seasonal[295:305]
mont_arima_forecast=model_forecast_error+seasonality+last_trend_value

mont_daily_te$forecast<-paste(mont_arima_forecast)
mont_daily_te$forecast<-as.numeric(mont_daily_te$forecast)


ggplot(mont_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

error_test(mont_daily_te$sold_count,mont_daily_te$forecast)
##    n     mean       sd      bias mape      mad     wmape  MPE
## 1 11 1.545455 1.128152 -0.730025  Inf 1.397975 0.9045719 -Inf

In this case, the value of WMAPE turned out to be really high, because on some days the product is not available, and ARIMA could not predict whether the product would be sold

Regressor Search

We constructed three different linear model and compared the wmape values to decide on the regressors that will be added to ARIMA model

mont_lm1<-lm(sold_count~visit_count+is.discount_days, data=mont_daily_tr)
summary(mont_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = mont_daily_tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.329 -0.671 -0.663 -0.663 51.337 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.663459   0.189687   3.498 0.000524 ***
## visit_count      0.007301   0.002587   2.822 0.005016 ** 
## is.discount_days 1.592986   0.910109   1.750 0.080851 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.448 on 389 degrees of freedom
## Multiple R-squared:  0.02602,    Adjusted R-squared:  0.02102 
## F-statistic: 5.197 on 2 and 389 DF,  p-value: 0.005924
mont_lm2<-lm(sold_count~category_sold+price, data=mont_daily_tr)
summary(mont_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = mont_daily_tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.355 -2.910 -0.698  0.334 47.381 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.5755049  1.3268375   0.434  0.66545   
## category_sold 0.0001318  0.0008419   0.157  0.87594   
## price         0.0056698  0.0021337   2.657  0.00923 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.07 on 96 degrees of freedom
##   (293 observations deleted due to missingness)
## Multiple R-squared:  0.06883,    Adjusted R-squared:  0.04943 
## F-statistic: 3.548 on 2 and 96 DF,  p-value: 0.03261
mont_lm3<-lm(sold_count~favored_count+basket_count, data=mont_daily_tr)
summary(mont_lm3)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = mont_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.6804   0.0378   0.1035   0.1035   9.8401 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.103516   0.085889  -1.205    0.229    
## favored_count  0.003277   0.014855   0.221    0.826    
## basket_count   0.171107   0.004247  40.291   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.527 on 389 degrees of freedom
## Multiple R-squared:  0.809,  Adjusted R-squared:  0.808 
## F-statistic: 823.6 on 2 and 389 DF,  p-value: < 2.2e-16
mont_daily_te$forecastlm1<-predict(mont_lm1,mont_daily_te)
mont_daily_te$forecastlm2<-predict(mont_lm2,mont_daily_te)
mont_daily_te$forecastlm3<-predict(mont_lm3,mont_daily_te)

error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm1)
##    n     mean       sd       bias mape       mad     wmape  MPE
## 1 11 1.545455 1.128152 0.07639808  Inf 0.8102629 0.5242877 -Inf
error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm2)
##    n     mean       sd       bias mape     mad    wmape  MPE
## 1 11 1.545455 1.128152 -0.8498868  Inf 1.55831 1.008318 -Inf
error_test(mont_daily_te$sold_count,mont_daily_te$forecastlm3)
##    n     mean       sd      bias mape       mad     wmape  MPE
## 1 11 1.545455 1.128152 0.5395895  Inf 0.9814649 0.6350655 -Inf

Model 1 performed better with a wmape value of 55%. Despite its adjusted R-square value Model 1 resulted better at wmape therefore discount days information and visit counts will be used as regressors.

ARIMAX Model

mont_xreg1<-cbind(mont_daily_tr$is.discount_days,
                  mont_daily_tr$visit_count)
mont_xreg2<-cbind(mont_daily_te$is.discount_days,
                  mont_daily_te$visit_count)

mont_arimax<-Arima(mont_ts_daily,xreg=as.matrix(mont_xreg1),order=c(2,0,4))
mont_daily_te$forecastarimax<-forecast(mont_arimax,xreg=as.matrix(mont_xreg2))$mean
error_test(mont_daily_te$sold_count,mont_daily_te$forecastarimax)
##    n     mean       sd     bias mape       mad    wmape  MPE
## 1 11 1.545455 1.128152 0.160217  Inf 0.8518484 0.551196 -Inf

With the addition of regressors WMAPE value decreased from 85% to 57%.

ggplot(mont_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Above plot shows the forecasted and actual values with the ARIMAX model. Model significantly improved with the addition of regressors.

Oral-B Toothbrush

Decomposition

oralb<-subset(alldata, alldata$product_content_id==32939029)
#oralb_xts<-xts(oralb,order.by=oralb$event_date)
oralb_daily_tr<-oralb[oralb$event_date<"2021-06-21"]
oralb_daily_te<-oralb[oralb$event_date>="2021-06-21"]
plot(oralb$sold_count, type='l')

Plot above shows sold counts of Oral-B Toothbrush. Variance seems increasing and also sales are increasing over time.

oralb_ts_daily<-ts(oralb_daily_tr$sold_count, freq=7)
tsdisplay(oralb_ts_daily)

Necessary data manipulations are made in order to obtain daily data. ACF plot shows high correlation and PACF shows only significant correlation at lag 1. It will be examined further after decomposition.

oralb_daily_additive<-decompose(oralb_ts_daily, type ="additive")
oralb_daily_multiplicative<-decompose(oralb_ts_daily, type ="multiplicative")
plot(oralb_daily_additive)

plot(oralb_daily_multiplicative)

For additive decomposition random part has increasing variance, which does not satisfy constants variance assumption.For multiplicative decomposition random part has decreasing variance, which does not satisfy constants variance assumption as well.

In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test1=ur.kpss(oralb_daily_additive$random, use.lag="7")
summary(test1)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0132 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test2=ur.kpss(oralb_daily_multiplicative$random, use.lag="7")
summary(test2)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.3125 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Test statistic of additive decomposition resulted better therefore it is chosen for further examinations.

plot(oralb_daily_additive$seasonal[1:30], type='l')

plot(oralb_daily_additive$trend, type='l')

Plots above show closer looks for trend and seasonal components of additive decomposition. We can clearly see a seasonality in daily level.

Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

Weekly

oralb_xts<-xts(oralb_daily_tr$sold_count, order.by=as.Date(oralb_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(oralb_xts,on="weeks")
oralb_xts_weekly<-period.apply(oralb_xts,INDEX=weekly,FUN=mean)
plot(oralb_xts_weekly)

Monthly

oralb_xts<-xts(oralb_daily_tr$sold_count, order.by=as.Date(oralb_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(oralb_xts,on="months")
oralb_xts_weekly<-period.apply(oralb_xts,INDEX=monthly,FUN=mean)
plot(oralb_xts_weekly)

Weekly and monthly plots shows demand for such product increased over time. Even so, there are still outlier weeks and months that cannot be explained with only sold item count. Price or promotions may affect such outlier points.

ARIMA Model

To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.

tsdisplay(oralb_daily_additive$random, na.action = na.omit)

oralb_model<-arima(oralb_daily_additive$random, order = c(3,0,3))
summary(oralb_model)
## 
## Call:
## arima(x = oralb_daily_additive$random, order = c(3, 0, 3))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2      ma3  intercept
##       0.2399  0.0993  -0.2964  -0.2553  -0.6100  -0.1347    -0.0024
## s.e.  0.2453  0.2464   0.1538   0.2488   0.2466   0.1419     0.0251
## 
## sigma^2 estimated as 787.9:  log likelihood = -1837.79,  aic = 3691.58
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.4667117 28.07005 19.36278 49.37787 226.8666 0.6808855
##                      ACF1
## Training set -0.005579994

As can be seen from the graphs there is significant negative autocorrelation in lag 3 at ACF and partial autocorrelation decreased significantly after lag 3 in PACF therefore parameters would be selected as (3,0,3)

oralb_model2<-arima(oralb_daily_additive$random, order = c(2,0,3))
AIC(oralb_model2)
## [1] 3690.505

Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model

Fitting the model

model_forecast_error <- predict(oralb_model2, n.ahead = 11)$pred
last_trend_value <-tail(oralb_daily_additive$trend[!is.na(oralb_daily_additive$trend)],1)
seasonality=oralb_daily_additive$seasonal[295:305]
oralb_arima_forecast=model_forecast_error+seasonality+last_trend_value

oralb_daily_te$forecast<-paste(oralb_arima_forecast)
oralb_daily_te$forecast<-as.numeric(oralb_daily_te$forecast)


ggplot(oralb_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.

error_test(oralb_daily_te$sold_count,oralb_daily_te$forecast)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 80.27273 41.46587 -1.153151 1.661979 92.56657 1.153151 -1.661979

WMAPE value is too much in this case because of the reasons mentioned above.

Regressor Search

To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.

oralb_lm1<-lm(sold_count~visit_count+is.discount_days, data=oralb_daily_tr)
summary(oralb_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = oralb_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -124.33  -29.55  -16.55   15.01  235.86 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.055e+01  3.156e+00  12.849  < 2e-16 ***
## visit_count      2.343e-02  7.527e-04  31.134  < 2e-16 ***
## is.discount_days 4.858e+01  1.371e+01   3.543 0.000443 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52.01 on 389 degrees of freedom
## Multiple R-squared:  0.7192, Adjusted R-squared:  0.7178 
## F-statistic: 498.2 on 2 and 389 DF,  p-value: < 2.2e-16
oralb_lm2<-lm(sold_count~category_sold+price, data=oralb_daily_tr)
summary(oralb_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = oralb_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -242.52  -56.85  -26.14   36.76  380.15 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.950e+02  6.511e+01  -4.531 7.86e-06 ***
## category_sold  5.331e-02  5.824e-03   9.154  < 2e-16 ***
## price          2.501e+00  4.721e-01   5.297 2.00e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 87.45 on 380 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.2068, Adjusted R-squared:  0.2026 
## F-statistic: 49.54 on 2 and 380 DF,  p-value: < 2.2e-16
oralb_lm3<-lm(sold_count~favored_count+basket_count, data=oralb_daily_tr)
summary(oralb_lm3)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = oralb_daily_tr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -182.608  -10.381   -3.624    7.994  115.926 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.070275   2.068327   1.484    0.139    
## favored_count -0.036779   0.006199  -5.933 6.59e-09 ***
## basket_count   0.261759   0.007315  35.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.96 on 389 degrees of freedom
## Multiple R-squared:  0.9129, Adjusted R-squared:  0.9125 
## F-statistic:  2039 on 2 and 389 DF,  p-value: < 2.2e-16
oralb_daily_te$forecastlm1<-predict(oralb_lm1,oralb_daily_te)
oralb_daily_te$forecastlm2<-predict(oralb_lm2,oralb_daily_te)
oralb_daily_te$forecastlm3<-predict(oralb_lm3,oralb_daily_te)

error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm1)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 80.27273 41.46587 -0.2773368 0.5155861 26.13322 0.3255554 -0.4866785
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm2)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 80.27273 41.46587 -0.2715985 0.6714054 36.86528 0.4592504 -0.5582226
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastlm3)
##    n     mean       sd      bias      mape      mad     wmape      MPE
## 1 11 80.27273 41.46587 0.1447854 0.1452873 13.09143 0.1630869 0.117745

Model 3 resulted a WMAPE of 14% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.

ARIMAX Model

oralb_xreg1<-cbind(oralb_daily_tr$favored_count,
                  oralb_daily_tr$basket_count)
oralb_xreg2<-cbind(oralb_daily_te$favored_count,
                  oralb_daily_te$basket_count)

oralb_arimax<-Arima(oralb_ts_daily,xreg=as.matrix(oralb_xreg1),order=c(3,0,3))
oralb_daily_te$forecastarimax<-forecast(oralb_arimax,xreg=as.matrix(oralb_xreg2))$mean
error_test(oralb_daily_te$sold_count,oralb_daily_te$forecastarimax)
##    n     mean       sd       bias      mape      mad     wmape           MPE
## 1 11 80.27273 41.46587 0.06866556 0.1598583 12.86963 0.1603238 -0.0009525859

With the addition of regressors wmape value increased to 14.4%, in this case ARIMA doesn’t perform well.

ggplot(oralb_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Above plot shows the fitted values of the last model. Model fits well with the actual values.

Trendyolmilla Bikini 1

Decomposition

bikini1<-subset(alldata, alldata$product_content_id==73318567)
bikini1_daily_tr<-bikini1[bikini1$event_date<"2021-06-21"]
bikini1_daily_te<-bikini1[bikini1$event_date>="2021-06-21"]
plot(bikini1$sold_count, type = 'l')

Plot above shows the sold count for bikini1. Sale volumes increased at summer and also there were sales at March

bikini1_ts_daily<-ts(bikini1_daily_tr$sold_count, freq=7)
tsdisplay(bikini1_ts_daily)

Autocorrelation values are decreasing and PACF shows significant correlation at lag1.

bikini1_daily_additive<-decompose(bikini1_ts_daily, type ="additive")
bikini1_daily_multiplicative<-decompose(bikini1_ts_daily, type ="multiplicative")
plot(bikini1_daily_additive)

plot(bikini1_daily_multiplicative)

Constant variance assumption is violated in both case In order to understand which model is better, the method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test1=ur.kpss(bikini1_daily_additive$random, use.lag="7")
summary(test1)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0259 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test2=ur.kpss(bikini1_daily_multiplicative$random, use.lag="7")
summary(test2)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1168 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Test statistic of additivie decomposition resulted better, therefore it is chosen for further examination.

plot(bikini1_daily_additive$seasonal[1:30], type='l')

plot(bikini1_daily_additive$trend, type='l')

Checking the seasonal component, we can clearly see seasonality in daily level. Plots above show closer looks for trend and seasonal components of additive decomposition. Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

Weekly

bikini1_xts<-xts(bikini1_daily_tr$sold_count, order.by=as.Date(bikini1_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(bikini1_xts,on="weeks")
bikini1_xts_weekly<-period.apply(bikini1_xts,INDEX=weekly,FUN=mean)
plot(bikini1_xts_weekly)

Monthly

bikini1_xts<-xts(bikini1_daily_tr$sold_count, order.by=as.Date(bikini1_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(bikini1_xts,on="months")
bikini1_xts_weekly<-period.apply(bikini1_xts,INDEX=monthly,FUN=mean)
plot(bikini1_xts_weekly)

ARIMA Model

To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.

tsdisplay(bikini1_daily_additive$random, na.action = na.omit)

bikini1_model<-arima(bikini1_daily_additive$random, order = c(5,0,5))
summary(bikini1_model)
## 
## Call:
## arima(x = bikini1_daily_additive$random, order = c(5, 0, 5))
## 
## Coefficients:
##          ar1      ar2      ar3     ar4      ar5      ma1     ma2     ma3
##       0.1762  -0.3644  -0.4165  0.5881  -0.3656  -0.2840  0.0761  0.1389
## s.e.  0.1465   0.0546   0.0638  0.0727   0.0997   0.1591  0.0623  0.0520
##           ma4     ma5  intercept
##       -0.9411  0.0101     0.0003
## s.e.   0.0551  0.1546     0.0099
## 
## sigma^2 estimated as 77.51:  log likelihood = -1392.75,  aic = 2809.5
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.04492755 8.803879 3.927318 -37.79243 242.1186 0.6478351
##                       ACF1
## Training set -0.0001361801

As can be seen from the graphs there is significant negative autocorrelation in lag 5 at ACF and partial autocorrelation decreased significantly after lag 5 in PACF therefore parameters would be selected as (5,0,5)

bikini1_model2<-arima(bikini1_daily_additive$random, order = c(5,0,4))
AIC(bikini1_model2)
## [1] 2807.508

Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model

Fitting the model

model_forecast_error <- predict(bikini1_model2, n.ahead = 11)$pred
last_trend_value <-tail(bikini1_daily_additive$trend[!is.na(bikini1_daily_additive$trend)],1)
seasonality=bikini1_daily_additive$seasonal[295:305]
bikini1_arima_forecast=model_forecast_error+seasonality+last_trend_value

bikini1_daily_te$forecast<-paste(bikini1_arima_forecast)
bikini1_daily_te$forecast<-as.numeric(bikini1_daily_te$forecast)


ggplot(bikini1_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.

error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecast)
##    n     mean       sd       bias    mape      mad     wmape       MPE
## 1 11 26.36364 10.46205 -0.8187966 1.07578 21.83745 0.8283172 -1.070439

WMAPE value is too much in this case because of the reasons mentioned above.

Regressor Search

To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.

bikini1_lm1<-lm(sold_count~price+is.discount_days, data=bikini1_daily_tr)
summary(bikini1_lm1)
## 
## Call:
## lm(formula = sold_count ~ price + is.discount_days, data = bikini1_daily_tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -66.84 -62.22 -17.22  44.28 219.16 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      -196.957    842.376  -0.234   0.8156  
## price               4.387     14.008   0.313   0.7547  
## is.discount_days  -58.824     32.447  -1.813   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.76 on 108 degrees of freedom
##   (281 observations deleted due to missingness)
## Multiple R-squared:  0.03112,    Adjusted R-squared:  0.01318 
## F-statistic: 1.735 on 2 and 108 DF,  p-value: 0.1814
bikini1_lm2<-lm(sold_count~category_sold+price, data=bikini1_daily_tr)
summary(bikini1_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = bikini1_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -123.98  -28.98  -12.96   32.01  149.58 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.066e+03  7.208e+02   2.867  0.00499 ** 
## category_sold  2.333e-02  2.778e-03   8.399 1.95e-13 ***
## price         -3.425e+01  1.203e+01  -2.847  0.00529 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.87 on 108 degrees of freedom
##   (281 observations deleted due to missingness)
## Multiple R-squared:  0.3961, Adjusted R-squared:  0.3849 
## F-statistic: 35.42 on 2 and 108 DF,  p-value: 1.487e-12
bikini1_lm3<-lm(sold_count~favored_count+basket_count, data=bikini1_daily_tr)
summary(bikini1_lm3)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = bikini1_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.828  -0.148  -0.148  -0.148  34.767 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.148174   0.435072   0.341    0.734    
## favored_count -0.015716   0.001207 -13.017   <2e-16 ***
## basket_count   0.247511   0.003448  71.790   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.003 on 389 degrees of freedom
## Multiple R-squared:  0.9719, Adjusted R-squared:  0.9717 
## F-statistic:  6721 on 2 and 389 DF,  p-value: < 2.2e-16
bikini1_daily_te$forecastlm1<-predict(bikini1_lm1,bikini1_daily_te)
bikini1_daily_te$forecastlm2<-predict(bikini1_lm2,bikini1_daily_te)
bikini1_daily_te$forecastlm3<-predict(bikini1_lm3,bikini1_daily_te)

error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm1)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 26.36364 10.46205 -1.511946 1.846804 39.86038 1.511946 -1.846804
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm2)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 26.36364 10.46205 -4.673931 5.616296 123.2218 4.673931 -5.616296
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastlm3)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 26.36364 10.46205 0.04327184 0.2039842 5.068131 0.1922395 0.02200445

Model 3 resulted a WMAPE of 18% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.

ARIMAX Model

bikini1_xreg1<-cbind(bikini1_daily_tr$favored_count,
                  bikini1_daily_tr$basket_count)
bikini1_xreg2<-cbind(bikini1_daily_te$favored_count,
                  bikini1_daily_te$basket_count)

bikini1_arimax<-Arima(bikini1_ts_daily,xreg=as.matrix(bikini1_xreg1),order=c(3,0,3))
bikini1_daily_te$forecastarimax<-forecast(bikini1_arimax,xreg=as.matrix(bikini1_xreg2))$mean
error_test(bikini1_daily_te$sold_count,bikini1_daily_te$forecastarimax)
##    n     mean       sd       bias      mape      mad     wmape         MPE
## 1 11 26.36364 10.46205 0.02295718 0.2159454 5.033602 0.1909297 0.005587715

With the addition of regressors wmape value decreased to 18.3%.

ggplot(bikini1_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Above plot shows the fitted values of the last model. Model fits well with the actual values.

Trendyolmilla Bikini 2

Decomposition

bikini2<-subset(alldata, alldata$product_content_id==32737302)
#bikini2_xts<-xts(bikini2,order.by=bikini2$event_date)
bikini2_daily_tr<-bikini2[bikini2$event_date<"2021-06-21"]
bikini2_daily_te<-bikini2[bikini2$event_date>="2021-06-21"]
plot(bikini2$sold_count, type='l')

Plot above shows the sold counts for bikini2. Sales volumes shows a trend after January.

bikini2_ts_daily<-ts(bikini2_daily_tr$sold_count, freq=7)
tsdisplay(bikini2_ts_daily)

Autocorrelation plot indicates a trend and there is a significant correlation only in lag1 at PACF.

bikini2_daily_additive<-decompose(bikini2_ts_daily, type ="additive")
bikini2_daily_multiplicative<-decompose(bikini2_ts_daily, type ="multiplicative")
plot(bikini2_daily_additive)

plot(bikini2_daily_multiplicative)

For additive decomposition variance seems more constant comparing with the multiplicative decomposition. At multiplicative decomposition there is pikes at some days. The method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test1=ur.kpss(bikini2_daily_additive$random, use.lag="7")
summary(test1)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0667 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test2=ur.kpss(bikini2_daily_multiplicative$random, use.lag="7")
summary(test2)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.122 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Additive decomposition resulted better in terms of the test statistic, therefore it is chosen for further examination

plot(bikini2_daily_additive$seasonal[1:30], type='l')

plot(bikini2_daily_additive$trend, type='l')

Checking the seasonal component, we can clearly see seasonality in daily level. Plots above show closer looks for trend and seasonal components of additive decomposition. Since there are not enough data points, decomposition cannot made different levels. Altough we cannot decompose, weekly and monthly plots can be seen below.

Weekly

bikini2_xts<-xts(bikini2_daily_tr$sold_count, order.by=as.Date(bikini2_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(bikini2_xts,on="weeks")
bikini2_xts_weekly<-period.apply(bikini2_xts,INDEX=weekly,FUN=mean)
plot(bikini2_xts_weekly)

Monthly

bikini2_xts<-xts(bikini2_daily_tr$sold_count, order.by=as.Date(bikini2_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(bikini2_xts,on="months")
bikini2_xts_weekly<-period.apply(bikini2_xts,INDEX=monthly,FUN=mean)
plot(bikini2_xts_weekly)

We can see the trend occurrence after January from both weekly and monthly transformed data.

ARIMA Model

To decide on the parameters of ARIMA model, autocorrelation and partial autocorrelation of the random component of additive model are investigated.

tsdisplay(bikini2_daily_additive$random, na.action = na.omit)

bikini2_model<-arima(bikini2_daily_additive$random, order = c(3,0,3))
summary(bikini2_model)
## 
## Call:
## arima(x = bikini2_daily_additive$random, order = c(3, 0, 3))
## 
## Coefficients:
##           ar1      ar2      ar3      ma1      ma2      ma3  intercept
##       -0.0022  -0.1073  -0.0801  -0.0176  -0.4037  -0.5787    -0.0061
## s.e.   0.1235   0.0785   0.0811   0.1174   0.0618   0.1070     0.0046
## 
## sigma^2 estimated as 22.61:  log likelihood = -1152.51,  aic = 2321.03
## 
## Training set error measures:
##                       ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.05595952 4.755316 2.640292 74.72519 144.1095 0.6245488
##                     ACF1
## Training set -0.01022994

As can be seen from the graphs there is significant negative autocorrelation in lag 3 at ACF and partial autocorrelation decreased significantly after lag 3 in PACF therefore parameters would be selected as (3,0,3)

bikini2_model2<-arima(bikini2_daily_additive$random, order = c(2,0,3))
AIC(bikini2_model2)
## [1] 2320.045

Model 2 is better in terms of AIC values, therefore it is chosen to construct the ARIMA model

Fitting the model

model_forecast_error <- predict(bikini2_model2, n.ahead = 11)$pred
last_trend_value <-tail(bikini2_daily_additive$trend[!is.na(bikini2_daily_additive$trend)],1)
seasonality=bikini2_daily_additive$seasonal[295:305]
bikini2_arima_forecast=model_forecast_error+seasonality+last_trend_value

bikini2_daily_te$forecast<-paste(bikini2_arima_forecast)
bikini2_daily_te$forecast<-as.numeric(bikini2_daily_te$forecast)


ggplot(bikini2_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

From above plot, it can be seen that we are overpredicting, it is mainly due to the constant trend assumption, with the addition of regressors this bias may be decreased.

error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecast)
##    n mean      sd      bias      mape      mad    wmape        MPE
## 1 11   55 16.2911 -0.506666 0.6193516 28.61947 0.520354 -0.6106982

WMAPE value is too much in this case because of the reasons mentioned above.

Regressor Search

To decide on which regressors to use, 3 different linear models are constructed and compared in terms of their error rates.

bikini2_lm1<-lm(sold_count~price+is.discount_days, data=bikini2_daily_tr)
summary(bikini2_lm1)
## 
## Call:
## lm(formula = sold_count ~ price + is.discount_days, data = bikini2_daily_tr)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42.81 -14.97  -0.50  10.09  72.68 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)       -51.198     62.497  -0.819    0.414
## price               1.370      1.039   1.318    0.189
## is.discount_days   12.834      9.283   1.383    0.169
## 
## Residual standard error: 20.53 on 167 degrees of freedom
##   (222 observations deleted due to missingness)
## Multiple R-squared:  0.01545,    Adjusted R-squared:  0.003658 
## F-statistic:  1.31 on 2 and 167 DF,  p-value: 0.2725
bikini2_lm2<-lm(sold_count~category_sold+price, data=bikini2_daily_tr)
summary(bikini2_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = bikini2_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.575  -8.500  -0.676   8.639  44.648 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.266e+02  4.183e+01   3.027  0.00286 ** 
## category_sold  9.154e-03  6.948e-04  13.173  < 2e-16 ***
## price         -1.938e+00  7.044e-01  -2.751  0.00660 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.46 on 167 degrees of freedom
##   (222 observations deleted due to missingness)
## Multiple R-squared:  0.5116, Adjusted R-squared:  0.5058 
## F-statistic: 87.48 on 2 and 167 DF,  p-value: < 2.2e-16
bikini2_lm3<-lm(sold_count~favored_count+basket_count, data=bikini2_daily_tr)
summary(bikini2_lm3)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count, data = bikini2_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.712  -0.063   0.159   0.159  35.205 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.159138   0.356595  -0.446    0.656    
## favored_count -0.025855   0.002303 -11.225   <2e-16 ***
## basket_count   0.222054   0.005072  43.783   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.785 on 389 degrees of freedom
## Multiple R-squared:  0.9219, Adjusted R-squared:  0.9215 
## F-statistic:  2294 on 2 and 389 DF,  p-value: < 2.2e-16
bikini2_daily_te$forecastlm1<-predict(bikini2_lm1,bikini2_daily_te)
bikini2_daily_te$forecastlm2<-predict(bikini2_lm2,bikini2_daily_te)
bikini2_daily_te$forecastlm3<-predict(bikini2_lm3,bikini2_daily_te)

error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm1)
##    n mean      sd      bias      mape      mad     wmape       MPE
## 1 11   55 16.2911 0.4185357 0.3798232 23.01947 0.4185357 0.3798232
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm2)
##    n mean      sd       bias      mape      mad     wmape       MPE
## 1 11   55 16.2911 -0.1482096 0.3908551 20.80109 0.3782016 -0.242806
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastlm3)
##    n mean      sd      bias     mape      mad     wmape       MPE
## 1 11   55 16.2911 0.1197114 0.127294 7.265821 0.1321058 0.1088184

Model 3 resulted a WMAPE of 13.5% it is much lower than the wmape of the ARIMA model. Favored count and basket count will be added to the model.

SARIMAX Model

bikini2_xreg1<-cbind(bikini2_daily_tr$favored_count,
                  bikini2_daily_tr$basket_count)
bikini2_xreg2<-cbind(bikini2_daily_te$favored_count,
                  bikini2_daily_te$basket_count)

bikini2_arimax<-Arima(bikini2_ts_daily,xreg=as.matrix(bikini2_xreg1),order=c(2,0,2))
bikini2_daily_te$forecastarimax<-forecast(bikini2_arimax,xreg=as.matrix(bikini2_xreg2))$mean
error_test(bikini2_daily_te$sold_count,bikini2_daily_te$forecastarimax)
##    n mean      sd        bias      mape      mad     wmape         MPE
## 1 11   55 16.2911 0.005079662 0.1186635 6.154088 0.1118925 -0.02147385

With the addition of regressors wmape value decreased to 10%

ggplot(bikini2_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Above plot shows the fitted values of the last model. Model fits well with the actual values.

Trendyolmilla Tayt

Decomposition

tayt<-subset(alldata, alldata$product_content_id==31515569)
#tayt_xts<-xts(tayt,order.by=tayt$event_date)
tayt_daily_tr<-tayt[tayt$event_date<"2021-06-21"]
tayt_daily_te<-tayt[tayt$event_date>="2021-06-21"]
plot(tayt$sold_count, type='l')

Above plot shows the sales volumes for Trendyolmilla Tayt. Tayt has been sold for the whole year and there is some pikes at certain days.

tayt_ts_daily<-ts(tayt_daily_tr$sold_count, freq=7)
tsdisplay(tayt_ts_daily)

Necessary data manipulations are made in order to obtain daily data. ACF plot shows decreasing correlation at lag 1,2 and 3 and also there is a correlation after lag 14. PACF shows only significant correlation at lag 1. It will be examined further after decomposition.

tayt_daily_additive<-decompose(tayt_ts_daily, type ="additive")
tayt_daily_multiplicative<-decompose(tayt_ts_daily, type ="multiplicative")
plot(tayt_daily_additive)

plot(tayt_daily_multiplicative)

Above plots show trend, seasonal and random components of multiplicative and additive decomposition. There is pikes at certain days in additive decomposition and constant variance assumption is violated. For multiplicative decomposition variance is higher than the additive. The method to be used in the continuation of the study can be decided by applying specific tests that can measure stationarity for random part.

test1=ur.kpss(tayt_daily_additive$random, use.lag="7")
summary(test1)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.0126 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739
test2=ur.kpss(tayt_daily_multiplicative$random, use.lag="7")
summary(test2)
## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 7 lags. 
## 
## Value of test-statistic is: 0.1298 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

Additive decomposition resulted better and is chosen to be used for futher examination.

Since there are not enough data points, decomposition cannot made in different levels. Although we cannot decompose, weekly and monthly plots can be seen below.

plot(tayt_daily_additive$seasonal[1:30], type='l')

plot(tayt_daily_additive$trend, type='l')

Weekly

tayt_xts<-xts(tayt_daily_tr$sold_count, order.by=as.Date(tayt_daily_tr$event_date,"%d.%m.%Y"))
weekly<-endpoints(tayt_xts,on="weeks")
tayt_xts_weekly<-period.apply(tayt_xts,INDEX=weekly,FUN=mean)
plot(tayt_xts_weekly)

Monthly

tayt_xts<-xts(tayt_daily_tr$sold_count, order.by=as.Date(tayt_daily_tr$event_date,"%d.%m.%Y"))
monthly<-endpoints(tayt_xts,on="months")
tayt_xts_monthly<-period.apply(tayt_xts,INDEX=monthly,FUN=mean)
plot(tayt_xts_monthly)

It can be seen that sales volumes of tayt increased at November influenced by the discount days, which are determined in our project to yield better forecast result.

ARIMA Model

tsdisplay(tayt_daily_additive$random, na.action = na.omit)

According to ACF plot, there is significant negative correlation at lag 4. Hence, moving average parameter should be 4. Also, it can be seen from PACF plot that correlation significantly decreased after lag 1 which means that AR parameter should be 1.

tayt_model<-arima(tayt_daily_additive$random, order = c(1,0,4))
AIC(tayt_model)
## [1] 5838.382
tayt_model2<-arima(tayt_daily_additive$random, order = c(1,0,3))
AIC(tayt_model2)
## [1] 5852.143
tayt_model3<-arima(tayt_daily_additive$random, order = c(1,0,2))
AIC(tayt_model3)
## [1] 5953.414

When adjacent parameters also checked, still model 1 which has (1,0,4) parameters is better in terms of AIC values, therefore it is chosen.

Fitting the model

model_forecast_error <- predict(tayt_model, n.ahead = 11)$pred
last_trend_value <-tail(tayt_daily_additive$trend[!is.na(tayt_daily_additive$trend)],1)
seasonality=tayt_daily_additive$seasonal[295:305]
tayt_arima_forecast=model_forecast_error+seasonality+last_trend_value

tayt_daily_te$forecast<-paste(tayt_arima_forecast)
tayt_daily_te$forecast<-as.numeric(tayt_daily_te$forecast)


ggplot(tayt_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecast,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

Plot above shows for 10 days actual and predicted values with ARIMA model that we built on decomposed random part. Model that we built with ARIMA (1,0,4) parameters shows peaks and downs on sales. It does not clearly explain actual sales data.

error_test(tayt_daily_te$sold_count,tayt_daily_te$forecast)
##    n     mean       sd        bias      mape      mad     wmape        MPE
## 1 11 293.3636 82.88941 -0.09614783 0.5779659 153.5975 0.5235738 -0.1245216

According to WMAPE —————————————-

Regressor Search

tayt_lm1<-lm(sold_count~visit_count+is.discount_days, data=tayt_daily_tr)
summary(tayt_lm1)
## 
## Call:
## lm(formula = sold_count ~ visit_count + is.discount_days, data = tayt_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3179.6  -386.1  -126.8   177.4  5935.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.513e+02  4.318e+01  17.400  < 2e-16 ***
## visit_count      -7.606e-03  2.083e-03  -3.652 0.000295 ***
## is.discount_days  4.193e+03  2.021e+02  20.746  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 743.4 on 389 degrees of freedom
## Multiple R-squared:  0.5267, Adjusted R-squared:  0.5242 
## F-statistic: 216.4 on 2 and 389 DF,  p-value: < 2.2e-16
tayt_lm2<-lm(sold_count~category_sold+price, data=tayt_daily_tr)
summary(tayt_lm2)
## 
## Call:
## lm(formula = sold_count ~ category_sold + price, data = tayt_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2726.2  -267.0   -17.8   282.4  5051.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3493.53040  277.79024   12.58   <2e-16 ***
## category_sold    0.33963    0.01494   22.74   <2e-16 ***
## price          -68.28039    5.62092  -12.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 680.3 on 389 degrees of freedom
## Multiple R-squared:  0.6036, Adjusted R-squared:  0.6016 
## F-statistic: 296.2 on 2 and 389 DF,  p-value: < 2.2e-16
tayt_lm3<-lm(sold_count~price+favored_count+category_sold, data=tayt_daily_tr)
summary(tayt_lm3)
## 
## Call:
## lm(formula = sold_count ~ price + favored_count + category_sold, 
##     data = tayt_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2743.7  -259.6   -48.5   243.8  5252.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3656.61096  282.50779  12.943  < 2e-16 ***
## price          -72.54566    5.80720 -12.492  < 2e-16 ***
## favored_count    0.05999    0.02271   2.642  0.00858 ** 
## category_sold    0.32884    0.01538  21.386  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 675.1 on 388 degrees of freedom
## Multiple R-squared:  0.6106, Adjusted R-squared:  0.6076 
## F-statistic: 202.8 on 3 and 388 DF,  p-value: < 2.2e-16
tayt_lm4<-lm(sold_count~favored_count+basket_count+visit_count+is.discount_days, data=tayt_daily_tr)
summary(tayt_lm4)
## 
## Call:
## lm(formula = sold_count ~ favored_count + basket_count + visit_count + 
##     is.discount_days, data = tayt_daily_tr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3437.6  -123.9   -24.3   136.8  5043.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.774e+01  3.945e+01   1.971   0.0495 *  
## favored_count     2.698e-02  2.104e-02   1.283   0.2004    
## basket_count      2.024e-01  9.078e-03  22.297  < 2e-16 ***
## visit_count      -1.542e-02  1.796e-03  -8.586 2.22e-16 ***
## is.discount_days  1.829e+03  1.669e+02  10.959  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 474.1 on 387 degrees of freedom
## Multiple R-squared:  0.8085, Adjusted R-squared:  0.8065 
## F-statistic: 408.5 on 4 and 387 DF,  p-value: < 2.2e-16
tayt_daily_te$forecastlm1<-predict(tayt_lm1,tayt_daily_te)
tayt_daily_te$forecastlm2<-predict(tayt_lm2,tayt_daily_te)
tayt_daily_te$forecastlm3<-predict(tayt_lm3,tayt_daily_te)
tayt_daily_te$forecastlm4<-predict(tayt_lm4,tayt_daily_te)

error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm1)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 293.3636 82.88941 -1.295053 1.447989 379.9214 1.295053 -1.447989
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm2)
##    n     mean       sd      bias     mape      mad    wmape       MPE
## 1 11 293.3636 82.88941 -4.682086 4.716494 1373.554 4.682086 -4.716494
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm3)
##    n     mean       sd      bias     mape     mad    wmape       MPE
## 1 11 293.3636 82.88941 -4.287784 4.292085 1257.88 4.287784 -4.292085
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastlm4)
##    n     mean       sd       bias      mape      mad     wmape        MPE
## 1 11 293.3636 82.88941 0.09496337 0.1889774 59.66235 0.2033734 0.05237606

4 different multiple linear regression model is built to understand which regressors are explaining model better. In terms of adjusted R square and WMAPE 4th model is explaining sales behavior better than others. Therefore, 4th model’s regressors will be chosen which are favored count, basket count, visit count and discount days.

ARIMAX Model

tayt_xreg1<-cbind(tayt_daily_tr$favored_count,
                  tayt_daily_tr$basket_count,
                  tayt_daily_tr$visit_count,
                  tayt_daily_tr$is.discount_days)
tayt_xreg2<-cbind(tayt_daily_te$favored_count,
                  tayt_daily_te$basket_count,
                  tayt_daily_te$visit_count,
                  tayt_daily_te$is.discount_days)

tayt_arimax<-Arima(tayt_ts_daily,xreg=as.matrix(tayt_xreg1),order=c(1,0,4))
tayt_daily_te$forecastarimax<-forecast(tayt_arimax,xreg=as.matrix(tayt_xreg2))$mean
error_test(tayt_daily_te$sold_count,tayt_daily_te$forecastarimax)
##    n     mean       sd       bias      mape     mad     wmape         MPE
## 1 11 293.3636 82.88941 0.03536987 0.1820554 54.7053 0.1864761 -0.01095385

WMAPE is 0.17 which is little less than multiple linear regression model with same regressors. Plot below shows forecasted values with ARIMAX vs actual sales.

ggplot(tayt_daily_te ,aes(x=event_date)) +
        geom_line(aes(y=sold_count,color='actual')) + 
        geom_line(aes(y=forecastarimax,color='fitted')) + labs(title = "Actual vs Forecasted Sales", 
       x = "Time",
       y = "Sales"
       )+
  theme(plot.title = element_text(color = "black", size = 20, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(size = 10, face = "bold", hjust = 0.5),
        plot.caption = element_text(face = "italic", hjust = 0))

There is ultra difference in forecasted values between ARIMA and ARIMAX models. As can be seen from plot above, regressors improved model.

Conclusion

In this study forecasting models are constructed for various products that are being sold in Trendyol. Products were decomposed to their seasonal and trend components, after deciding on the appropriate decomposition technique, random components of the products are evaluated in terms of their autocorrelation and partial autocorrelation plots to choose the best ARIMA model. Potential regressors are investigated and added to the model according to their WMAPE values. In overall, models resulted well and they can be used in forecasting the upcoming sales volumes. Some of assumptions made such as taking the last trend value and considering it as constant reduced the overall performance of the models, and also since we used the regressor values from test set instead of predicting them, error rates of the real-time forecasts would be higher than the values found in this study. If future product prices were given, more precise forecasting might be achieved. Furthermore, because the data had a large number of missing values, using such variables as regressors was useless. We haven’t used other sources like Google Trends to back up our forecasts since we all agreed that sometimes the best data to use is just the basic facts. Using other resources may result in better results.